{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "eb869c3b-6e78-411e-b48b-55f7e4e5fbe3",
   "metadata": {},
   "source": [
    "# Preprocess coding for:<br>**Immune Signatures of the First Living Human Recipient of a Gene-Edited Pig Kidney**\n",
    "\n",
    "*Authors: Guilherme T. Ribas, André F. Cunha, Jonathan P. Avila, Alessia Giarraputo, Leela Morena, Karina Lima, Rodrigo B. Gassen, Jia-Yun Chen, Jia-Ren Lin, Sandro Santagata, Claire T. Avillach, Birgitta A. Ryback, Martin S. Lindner, Sivan Bercovici, Ivy A. Rosales,\n",
    "Tatsuo Kawai, Helder I. Nakaya, R. B. Colvin, Thiago J. Borges, Leonardo V. Riella*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6f9fbec-8e92-43f1-a5db-106757e1f610",
   "metadata": {},
   "source": [
    "## Downloading public dataset (GSE120649)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a4a2267-fa8c-417d-a260-12a99d0973dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "fasterq-dump --split-files -A ${SRR_code} -O ./public_raw_reads/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c5d6024-8eda-44d5-bf9c-9b9c73daa9a5",
   "metadata": {},
   "source": [
    "## Trimming adapters with Fastp v0.24.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "327673d7-370d-4ec7-87c3-d05b3c94af1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "fastp -i ${file}.fastq -o ${out_file}${file}.trimmed.fastq"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53c8c1f3-1a6b-4fe1-ae8a-6a373411af34",
   "metadata": {},
   "source": [
    "## Downloading reference Transcriptome and Annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "968c7e5c-7d67-4175-b23a-7efee7f14713",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.transcripts.fa.gz\n",
    "wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.basic.annotation.gtf.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cf5767a-7c31-40a7-8fca-52ed99dddf81",
   "metadata": {},
   "source": [
    "## Estimating Abundance with Salmon"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c8b1109-0b1e-4896-a210-f54f64e2dc9e",
   "metadata": {},
   "source": [
    "### Indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21a6a398-cd03-42d3-8d11-0b18a13ac01b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "salmon index -i gencode.v46.transcripts_salmon_gencode.idx -t gencode.v46.transcripts.fa.gz --gencode"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cdbd909-b0c1-4671-9e33-c5857c7b9a66",
   "metadata": {},
   "source": [
    "### Quantifying"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "361accbc-8b27-4f7d-9de8-f7b4781d0890",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "salmon quant -i gencode.v46.transcripts_salmon_gencode.idx -l A -r ${out_file}${file}.trimmed.fastq -p 8 --validateMappings -o ${out_salmon_file}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c46ce1f0-a7cf-460c-b4ea-97382d3773cd",
   "metadata": {},
   "source": [
    "## Import and summarize transcript-level to gene-level with Tximport"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc7fc925-e69c-45b4-affe-0f41f057f4ce",
   "metadata": {},
   "source": [
    "### Preparing tx2gene file to tximport argument.\n",
    "\n",
    "<span style=\"color:red\">**Obs:** If you use this code in future publications, please, **refer this paper or the code DOI (https://doi.org/10.7910/DVN/HMKPXS).</style>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef534370-e8b6-41a7-b2b5-38fa250966cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gzip\n",
    "\n",
    "ann_file = 'gencode.v46.basic.annotation.gtf.gz'\n",
    "with open(ann_file, 'r') as gtf:\n",
    "    transcript_id_list = []\n",
    "    gene_id_list = []\n",
    "    gene_name_list = []\n",
    "    gene_type_list = []\n",
    "    gene_length_list = []\n",
    "\n",
    "    for lines in gtf:\n",
    "        line = lines\n",
    "        if not line.startswith('#'):\n",
    "            a = line.split('\\t')\n",
    "            if a[2] == 'gene':\n",
    "                for e in a[8].split(';'):\n",
    "                    if 'transcript_id' in e:\n",
    "                        transcript_id = e.replace('\"','').split(' ')[-1]\n",
    "                    if 'gene_id' in e:\n",
    "                        gene_id = e.replace('\"','').split(' ')[-1]\n",
    "                    if 'gene_name' in e:\n",
    "                        gene_name = e.replace('\"','').split(' ')[-1]\n",
    "                    # if 'gene_type' in e:\n",
    "                    if 'gene_biotype' in e:\n",
    "                        gene_type = e.replace('\"','').split(' ')[-1]\n",
    "                transcript_id_list.append('')#transcript_id)\n",
    "                gene_id_list.append(gene_id)\n",
    "                gene_name_list.append(gene_name)\n",
    "                gene_type_list.append(gene_type)\n",
    "                gene_length_list.append(str(int(a[4])-int(a[3]) + 1))\n",
    "\n",
    "with open('./gene_id2gene_nameWgene_lenght_biotype.csv', 'w') as gene_id2gene_name:\n",
    "    gene_id2gene_name.write('Gene_name,Gene_ID,Gene_type,Gene_lenght\\n')\n",
    "    for i,g_id in enumerate(gene_id_list):\n",
    "        gene_id2gene_name.write(gene_name_list[i]+','+g_id+','+gene_type_list[i]+','+gene_length_list[i]+'\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21504c3e-1a63-43fc-818b-906c0ac68065",
   "metadata": {},
   "source": [
    "### Importing Salmon abudance data with tximport"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c7f2d7e-7c96-4b19-8349-afc900516f53",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%R\n",
    "library(\"tximport\")\n",
    "library(\"readr\")\n",
    "library(\"DESeq2\")\n",
    "\n",
    "tx2gene = read_csv('gene_id2gene_nameWgene_lenght_biotype.csv')\n",
    "list_quants_files = list.files('./out_salmon_file/')\n",
    "\n",
    "txi = tximport(files, type=\"salmon\", tx2gene=tx2gene)\n",
    "\n",
    "# Get only protein_coding genes\n",
    "pc = tx2gene[tx2gene$Gene_type=='protein_coding',]$Gene_name\n",
    "txi$counts = txi$counts[rownames(txi$counts) %in% pc,]\n",
    "txi$abundance = txi$abundance[rownames(txi$abundance) %in% pc,] \n",
    "txi$length = txi$length[rownames(txi$length) %in% pc,] "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c98e1b0-525e-4588-9570-3721fa0b89b7",
   "metadata": {},
   "source": [
    "### Normalizing Abudance via DESEQ2::vst function "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fad9e7b-6919-497e-8353-c7dabcfa009c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create colData just to run DESeqDataSetFromTximport.\n",
    "# We don't need the real metadata since we are running to normalize the data to VSD\n",
    "# and not to run differential expression analysis.\n",
    "colData = matrix(c(1,0), nrow = length(list_quants_files), ncol = 2, byrow = TRUE)\n",
    "colnames(colData) = list_quants_files\n",
    "\n",
    "ddsTxi <- DESeq2::DESeqDataSetFromTximport(txi,\n",
    "                                           colData = colData,\n",
    "                                           design = ~1)\n",
    "dds = DESeq2::DESeq(ddsTxi)\n",
    "colnames(dds) = metadata$samples\n",
    "\n",
    "# Get VSD normalized expressions\n",
    "vsd = vst(dds, blind=FALSE)\n",
    "\n",
    "write.csv(assay(vsd), file = 'xeno_bulk_exprs_vsd.csv', quote = FALSE)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:DSenv]",
   "language": "python",
   "name": "conda-env-DSenv-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
